In One Corner…
As we saw in our Faster Group Statistics post, data.table
is the
heavyweight champ in the field. Its gforce
functions and fast grouping put it
head and shoulders above all challengers. And yet, here we are, about to throw
our hat in the ring with nothing but base R functionality. Are we out of our
minds?
Obviously I wouldn’t be writing this if I didn’t think we had a chance,
although the only reason we have a chance is because data.table
generously
contributed its fast radix sort to R 3.3.0. Perhaps it is ungracious of
us to use it to try to beat data.table
, but where’s the fun in being gracious?
The Ring, and a Warmup
Ten million observations, ~one million groups, no holds barred:
RNGversion("3.5.2"); set.seed(42)
n <- 1e7
n.grp <- 1e6
grp <- sample(n.grp, n, replace=TRUE)
noise <- rep(c(.001, -.001), n/2) # more on this later
x <- runif(n) + noise
y <- runif(n) + noise # we'll use this later
Let’s do a warm-up run, with a simple statistic. We use vapply
/split
instead of tapply
as that will allow us to work with more complex statistics
later. sys.time
is a wrapper around system.time
that runs the expression
eleven times and returns the median timing. It is defined in the
appendix.
sys.time({
grp.dat <- split(x, grp)
x.ref <- vapply(grp.dat, sum, 0)
})
user system elapsed
6.235 0.076 6.316
Let’s repeat by ordering the data first because pixie dust:
sys.time({
o <- order(grp)
go <- grp[o]
xo <- x[o]
grp.dat <- split(xo, go)
xgo.sum <- vapply(grp.dat, sum, numeric(1))
})
user system elapsed
2.743 0.092 2.840
And now with data.table
:
library(data.table)
DT <- data.table(grp, x)
setDTthreads(1) # turn off multi-threading
sys.time(x.dt <- DT[, sum(x), keyby=grp][[2]])
user system elapsed
0.941 0.030 0.973
Ouch. Even without multithreading data.table
crushes even the ordered
split
/vapply
. We use one thread for more stable and comparable results.
We’ll show some multi-threaded benchmarks at the end.
Interlude - Better Living Through Sorted Data
Pixie dust is awesome, but there is an even more important reason to like sorted
data: it opens up possibilities for better algorithms. unique
makes for a
good illustration. Let’s start with a simple run:
sys.time(u0 <- unique(grp))
user system elapsed
1.223 0.055 1.290
We are ~40% faster if we order first, including the time to order1:
sys.time({
o <- order(grp)
go <- grp[o]
u1 <- unique(go)
})
user system elapsed
0.884 0.049 0.937
identical(sort(u0), u1)
[1] TRUE
The interesting thing is that once the data is ordered we don’t even need to use
unique
and its inefficient hash table. For example, in:
(go.hd <- head(go, 30))
[1] 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 3 3 3 3 3 3 3 4 4 4 4 4
We just need to find the positions where the values change to find the unique
values, which we can do with diff
, or the
slightly-faster-for-this-purpose2:
go[-1L] != go[-length(go)]
It is clear looking at the vectors side by side that the groups change when they are not equal (showing first 30):
go[-1L] : 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 3 3 3 3 3 3 3 4 4 4 4 4
go[-length(go)] : 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 3 3 3 3 3 3 3 4 4 4 4
To get the unique values we can just use the above to index into go
, though we
must offset by one element:
sys.time({
o <- order(grp)
go <- grp[o]
u2 <- go[c(TRUE, go[-1L] != go[-length(go)])]
})
user system elapsed
0.652 0.017 0.672
identical(u1, u2)
[1] TRUE
Same result, but twice the speed of the original, again including the time to order. Most of the time is spent ordering as we can see by how quickly we pick out the unique values once the data is ordered:
sys.time(u2 <- go[c(TRUE, go[-1L] != go[-length(go)])])
user system elapsed
0.135 0.016 0.151
The main point I’m trying to make here is that it is a big deal that order
is fast enough that we can switch the algorithms we use downstream and get an
even bigger performance improvement.
A big thank you to team
data.table
for sharing the pixie dust.
Group Sums
It’s cute that we can use our newfound power to find unique values, but can we
do something more sophisticated? It turns out we can. John Mount, shows
how to compute group sums using cumsum
on group-ordered
data3. With a little work we can generalize it.
The concept is to order by group, compute cumulative sum, pull out the last value for each group, and take their differences. Visually:
This is the data we used for the visualization:
g1
[1] 1 2 3 2 3 3 1
x1
[1] 0.915 0.937 0.286 0.830 0.642 0.519 0.737
The first three steps are obvious:
ord <- order(g1)
go <- g1[ord]
xo <- x1[ord]
xc <- cumsum(xo)
Picking the last value from each group is a little harder, but we can do so with
the help of base::rle
. rle
returns the lengths of repeated-value
sequences within a vector. In a vector of ordered group ids, we can use it to
compute the lengths of each group:
go
[1] 1 1 2 2 3 3 3
grle <- rle(go)
(gn <- grle[['lengths']])
[1] 2 2 3
This tells us the first group has two elements, the
second also two, and the last three. We can translate this into indices of the
original vector with cumsum
, and use it to pull out the relevant values from
the cumulative sum of the x
values:
(gnc <- cumsum(gn))
[1] 2 4 7
(xc.last <- xc[gnc])
[1] 1.65 3.42 4.87
To finish we just take the differences:
diff(c(0, xc.last))
[1] 1.65 1.77 1.45
I wrapped the whole thing into the group_sum
function you can
see in the appendix:
group_sum(x1, g1)
1 2 3
1.65 1.77 1.45
Every step of group_sum
is internally vectorized4, so the function is
fast. We demonstrate here with the original 10MM data set:
sys.time(x.grpsum <- group_sum(x, grp))
user system elapsed
1.098 0.244 1.344
all.equal(x.grpsum, c(x.ref), check.attributes=FALSE)
[1] TRUE
data.table
is still faster, but we’re within striking distance. Besides, the
real fight is up ahead.
Before we go on, I should note that R provides
base::rowsum
, not to be confused with its better known cousin
base::rowSums
. And why would you confuse them? Clearly the capitalization
and pluralization provide stark semantic clues that distinguish them like dawn
does night and day. Anyhow…, rowsum
is the only base R function I know of
that computes group statistics with arbitrary group sizes directly in compiled
code. If all we’re trying to do is group sums, then we’re better off using that
instead of our homegrown cumsum
version:
sys.time({
o <- order(grp)
x.rs <- rowsum(x[o], grp[o], reorder=FALSE)
})
user system elapsed
1.283 0.105 1.430
all.equal(x.grpsum, c(x.rs), check.attributes=FALSE)
[1] TRUE
A summary of timings so far:
vapply/split
and rowsum
use ordered inputs and include the time to order
them. data.table
is single thread, and
group_sum2
is a version of
group_sum
that handles NAs and infinite values. Since performance is
comparable for group_sum2
we will ignore the NA/Inf wrinkle going forward.
So You Think You Can Group-Stat?
Okay, great, we can sum quickly in base R. One measly stat. What good is that if we want to compute something more complex like the slope of a bivariate regression, as we did in our prior post? As a refresher this is what the calculation looks like:
The R equivalent is5:
slope <- function(x, y) {
x_ux <- x - mean.default(x)
y_uy <- y - mean.default(y)
sum(x_ux * y_uy) / sum(x_ux ^ 2)
}
We can see that sum
shows up explicitly, and somewhat implicitly via
mean
6. There are many statistics that essentially boil down to
adding things together, so we can use group_sum
(or in this case its simpler
form .group_sum_int
) as the wedge to breach the barrier to fast grouped
statistics in R:
.group_sum_int <- function(x, last.in.group) {
xgc <- cumsum(x)[last.in.group]
diff(c(0, xgc))
}
group_slope <- function(x, y, grp) {
## order inputs by group
o <- order(grp)
go <- grp[o]
xo <- x[o]
yo <- y[o]
## group sizes and group indices
grle <- rle(go)
gn <- grle[['lengths']]
gnc <- cumsum(gn) # Last index in each group
gi <- rep(seq_along(gn), gn) # Group recycle indices
## compute mean(x) and mean(y), and recycle them
## to each element of `x` and `y`:
sx <- .group_sum_int(xo, gnc)
ux <- (sx/gn)[gi]
sy <- .group_sum_int(yo, gnc)
uy <- (sy/gn)[gi]
## (x - mean(x)) and (y - mean(y))
x_ux <- xo - ux
y_uy <- yo - uy
## Slopes!
x_ux.y_uy <- .group_sum_int(x_ux * y_uy, gnc)
x_ux2 <- .group_sum_int(x_ux ^ 2, gnc)
setNames(x_ux.y_uy / x_ux2, grle[['values']])
}
The non-obvious steps involve gn
, gnc
, and gi
. As we saw
earlier with group_sum
gn
corresponds to how many elements
there are in each group, and gnc
to the index of the last element in each
group. Let’s illustrate with some toy values:
(xo <- 2:6) # some values
[1] 2 3 4 5 6
(go <- c(3, 3, 5, 5, 5)) # their groups
[1] 3 3 5 5 5
(gn <- rle(go)[['lengths']]) # the size of the groups
[1] 2 3
(gnc <- cumsum(gn)) # index of last item in each group
[1] 2 5
Since these are the same quantities used by group_sum
, we can use a
simpler version .group_sum_int
that takes the index of the last element in
each group as an input:
(sx <- .group_sum_int(xo, gnc)) # group sum
[1] 5 15
We re-use gnc
four times throughout the calculation, which is a big deal
because that is the slow step in the computation. With
the group sums we can derive the values:
(sx/gn) # mean of each group
[1] 2.5 5.0
But we need to compute , which means we need to recycle each
group’s value for each . This is what gi
does:
(gi <- rep(seq_along(gn), gn))
[1] 1 1 2 2 2
cbind(x=xo, ux=(sx/gn)[gi], g=go) # cbind to show relationship b/w values
x ux g
[1,] 2 2.5 3
[2,] 3 2.5 3
[3,] 4 5.0 5
[4,] 5 5.0 5
[5,] 6 5.0 5
For each original value, we have associated the corresponding
value. We compute uy
the same way as ux
, and once we have those two values
the rest of the calculation is straightforward.
While this is quite a bit of work, the results are remarkable:
sys.time(slope.gs <- group_slope(x, y, grp))
user system elapsed
1.794 0.486 2.312
Compare to the hand-optimized version of data.table
from one of our
earlier posts:
sys.time({
DT <- data.table(x, y, grp)
setkey(DT, grp)
DTsum <- DT[, .(ux=mean(x), uy=mean(y)), keyby=grp]
DT[DTsum, `:=`(x_ux=x - ux, y_uy=y - uy)]
DT[, `:=`(x_ux.y_uy=x_ux * y_uy, x_ux2=x_ux^2)]
DTsum <- DT[, .(x_ux.y_uy=sum(x_ux.y_uy), x_ux2=sum(x_ux2)), keyby=grp]
slope.dt <- DTsum[, .(grp, V1=x_ux.y_uy / x_ux2)]
})
user system elapsed
2.721 0.412 3.139
Oh snap, we’re ~30% faster than data.table
! And this is the painstakingly
optimized version of it that computes on groups directly in C code without the
per-group R evaluation overhead. We’re ~3x faster than the straight
“out-of-the-box” version DT[, slope(x, y), grp]
.
A summary of all the timings:
If I let data.table
use both my cores it comes close to our timings
(optim(mc)
), and presumably would do a little better still with more cores,
but a tie or a slight win for a multi-thread process over a single thread one is
a loss in my books.
More details for the benchmarks are in the appendix.
UPDATE: Michael Chirico points out that it is possible to
reformulate the slope equation into a more favorable form, and under that
form data.table
is faster (although our methods are close). I’ll defer
analysis of how generalizable this is to another post, but in the meantime you
can see those benchmarks in the appendix.
Controversy
As we bask in the glory of this upset we notice a hubbub around the judges table. A representative of the Commission on Precise Statics, gesticulating, points angrily at us. Oops. It turns out that our blazing fast benchmark hero is cutting some corners:
all.equal(slope.gs, slope.dt$V1, check.attributes=FALSE)
[1] "Mean relative difference: 0.0001161377"
cor(slope.dt$V1, slope.gs, use='complete.obs')
[1] 0.9999882
The answers are almost the same, but not exactly. Our cumsum
approach is
exhausting the precision available in double precision numerics. We could
remedy this by using a rowsums
based group_slope
, but that would
be slower as we would not be able to re-use the group index
data.
Oh, so close. We put up a good fight, but CoPS is unforgiving and we are disqualified.
Conclusions
We learned how we can use ordered data to our advantage, and did something quite
remarkable in the process: beat data.table
at its own game, but for a
technicality. Granted, this was for a more complex statistic. We will never be
able to beat data.table
for simple statistics with built-in gforce
counterparts (e.g. sum
, median
, etc.), but as soon as we step away from
those we have a chance, and even for those we are competitive7.
In Part 3 of the Hydra Chronicles we will explore why we’re running into precision issues and whether we can redeem ourselves (hint: we can).
Appendix
Acknowledgments
- John Mount, for the initial
cumsum
idea. - Michael Chirico for the clever alternate formulation to slope, and for having the time and patience to remind me of the expected value forms and manipulations of variance and covariance.
- Dirk Eddelbuettel for copy edit suggestions.
- Matt Dowle and the other
data.table
authors for contributing the radix sort to R. - Hadley Wickham and the
ggplot2
authors forggplot2
with which I made the plots in this post.
Image Credits
- Rock-em, 2009, by Ariel Waldman, under CC BY-SA 2.0, cropped.
Updates
- 2019-06-10:
- Slope reformulation.
- Included missing
sys.time
definition. - Bad links.
- 2019-06-11: session info.
- 2019-06-12: fix covariance formula.
Session Info
sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.5
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.12.2
sys.time
sys.time <- function(exp, reps=11) {
res <- matrix(0, reps, 5)
time.call <- quote(system.time({NULL}))
time.call[[2]][[2]] <- substitute(exp)
gc()
for(i in seq_len(reps)) {
res[i,] <- eval(time.call, parent.frame())
}
structure(res, class='proc_time2')
}
print.proc_time2 <- function(x, ...) {
print(
structure(
x[order(x[,3]),][floor(nrow(x)/2),],
names=c("user.self", "sys.self", "elapsed", "user.child", "sys.child"),
class='proc_time'
) ) }
Reusing Last Index
The key advantage group_slope
has is that it can re-use gnc
, the vector of
indices to the last value in each group. Computing gnc
is the expensive part
of the cumsum
group sum calculation:
o <- order(grp)
go <- grp[o]
xo <- x[o]
sys.time({
gn <- rle(go)[['lengths']]
gi <- rep(seq_along(gn), gn)
gnc <- cumsum(gn)
})
user system elapsed
0.398 0.134 0.535
Once we have gnc
the group sum is blazing fast:
sys.time(.group_sum_int(xo, gnc))
user system elapsed
0.042 0.008 0.050
Other Slope Benchmarks
vapply
Normal:
sys.time({
id <- seq_along(grp)
id.split <- split(id, grp)
slope.ply <- vapply(id.split, function(id) slope(x[id], y[id]), 0)
})
user system elapsed
12.416 0.142 12.573
all.equal(slope.ply, c(slope.rs), check.attributes=FALSE)
[1] TRUE
Sorted version:
sys.time({
o <- order(grp)
go <- grp[o]
id <- seq_along(grp)[o]
id.split <- split(id, go)
slope.ply2 <- vapply(id.split, function(id) slope(x[id], y[id]), 0)
})
user system elapsed
8.233 0.105 8.351
all.equal(slope.ply2, c(slope.rs), check.attributes=FALSE)
[1] TRUE
data.table
Normal:
setDTthreads(1)
DT <- data.table(grp, x, y)
sys.time(DT[, slope(x, y), grp])
user system elapsed
6.509 0.066 6.627
Normal multi-thread:
setDTthreads(0)
DT <- data.table(grp, x, y)
sys.time(DT[, slope(x, y), grp])
user system elapsed
7.979 0.112 6.130
Optimized:
library(data.table)
setDTthreads(1)
sys.time({
DT <- data.table(grp, x, y)
setkey(DT, grp)
DTsum <- DT[, .(ux=mean(x), uy=mean(y)), keyby=grp]
DT[DTsum, `:=`(x_ux=x - ux, y_uy=y - uy)]
DT[, `:=`(x_ux.y_uy=x_ux * y_uy, x_ux2=x_ux^2)]
DTsum <- DT[, .(x_ux.y_uy=sum(x_ux.y_uy), x_ux2=sum(x_ux2)), keyby=grp]
res.slope.dt2 <- DTsum[, .(grp, V1=x_ux.y_uy / x_ux2)]
})
user system elapsed
2.721 0.412 3.139
Optimized multi-core:
setDTthreads(0)
sys.time({
DT <- data.table(grp, x, y)
setkey(DT, grp)
DTsum <- DT[, .(ux=mean(x), uy=mean(y)), keyby=grp]
DT[DTsum, `:=`(x_ux=x - ux, y_uy=y - uy)]
DT[, `:=`(x_ux.y_uy=x_ux * y_uy, x_ux2=x_ux^2)]
DTsum <- DT[, .(x_ux.y_uy=sum(x_ux.y_uy), x_ux2=sum(x_ux2)), keyby=grp]
res.slope.dt2 <- DTsum[, .(grp, V1=x_ux.y_uy / x_ux2)]
})
user system elapsed
5.332 0.842 2.412
Reformulated Slope
Special thanks to Michael Chirico for providing this alternative formulation to the slope calculation:
Where we take E[...]
to signify mean(...)
. See the Wikipedia pages for
Variance and Covariance for the step-by-step simplifications of
the expected value expressions.
A key feature of this formulation is there is no interaction between grouped statistics and ungrouped as in the original. This saves the costly merge step and results in a substantially faster calculation (single thread):
sys.time({
DT <- data.table(x, y, xy=x*y, x2=x^2, grp)
slope.dt.re <- DT[,
.(ux=mean(x), uy=mean(y), uxy=mean(xy), ux2=mean(x2)),
keyby=grp
][,
setNames((uxy - ux*uy)/(ux2 - ux^2), grp)
]
})
user system elapsed
1.377 0.126 1.507
But careful as there are precision issues here too, as warned on the variance page:
This equation should not be used for computations using floating point arithmetic because it suffers from catastrophic cancellation if the two components of the equation are similar in magnitude. There exist numerically stable alternatives.
We observe this to a small extent by comparing to our vapply
based
calculation:
quantile(slope.ply2 - slope.dt.re, na.rm=TRUE)
0% 25% 50% 75% 100%
-6.211681e-04 -4.996004e-16 0.000000e+00 4.996004e-16 1.651546e-06
We can apply a similar reformulation to group_slope
:
group_slope_re <- function(x, y, grp) {
o <- order(grp)
go <- grp[o]
xo <- x[o]
yo <- y[o]
grle <- rle(go)
gn <- grle[['lengths']]
gnc <- cumsum(gn) # Last index in each group
ux <- .group_sum_int(xo, gnc)/gn
uy <- .group_sum_int(yo, gnc)/gn
uxy <- .group_sum_int(xo * yo, gnc)/gn
ux2 <- .group_sum_int(xo^2, gnc)/gn
setNames((uxy - ux * uy)/(ux2 - ux^2), grle[['values']])
}
sys.time(slope.gs.re <- group_slope_re(x, y, grp))
user system elapsed
1.548 0.399 1.957
In this case data.table
flips the advantage:
Function Definitions
group_sum
group_sum <- function(x, grp) {
## Order groups and values
ord <- order(grp)
go <- grp[ord]
xo <- x[ord]
## Last values
grle <- rle(go)
gnc <- cumsum(grle[['lengths']])
xc <- cumsum(xo)
xc.last <- xc[gnc]
## Take diffs and return
gs <- diff(c(0, xc.last))
setNames(gs, grle[['values']])
}
Cumulative Group Sum With NA and Inf
A correct implementation of the single pass cumsum
based group_sum
requires
a bit of work to handle both NA
and Inf
values. Both of these need to be
pulled out of the data ahead of the cumulative step otherwise they would wreck
all subsequent calculations. The rub is they need to be re-injected into the
results, and with Inf
we need to account for groups computing to Inf
,
-Inf
, and even NaN
.
I implemented a version of group_sum
that handles these for illustrative
purposes. It is lightly tested so you should not consider it to be generally
robust. We ignore the possibility of NAs in grp
, although that is something
that should be handled too as rle
treats each NA value as distinct.
group_sum2 <- function(x, grp, na.rm=FALSE) {
if(length(x) != length(grp)) stop("Unequal length args")
if(!(is.atomic(x) && is.atomic(y))) stop("Non-atomic args")
if(anyNA(grp)) stop("NA vals not supported in `grp`")
ord <- order(grp)
grp.ord <- grp[ord]
grp.rle <- rle(grp.ord)
grp.rle.c <- cumsum(grp.rle[['lengths']])
x.ord <- x[ord]
# NA and Inf handling. Checking inf makes this 5% slower, but
# doesn't seem worth adding special handling for cases w/o Infs
has.na <- anyNA(x)
if(has.na) {
na.x <- which(is.na(x.ord))
x.ord[na.x] <- 0
} else na.x <- integer()
inf.x <- which(is.infinite(x.ord))
any.inf <- length(inf.x) > 0
if(any.inf) {
inf.vals <- x.ord[inf.x]
x.ord[inf.x] <- 0
}
x.grp.c <- cumsum(x.ord)[grp.rle.c]
x.grp.c[-1L] <- x.grp.c[-1L] - x.grp.c[-length(x.grp.c)]
# Re-inject NAs and Infs as needed
if(any.inf) {
inf.grps <- findInterval(inf.x, grp.rle.c, left.open=TRUE) + 1L
inf.rle <- rle(inf.grps)
inf.res <- rep(Inf, length(inf.rle[['lengths']]))
inf.neg <- inf.vals < 0
# If more than one Inf val in group, need to make sure we don't have
# Infs of different signs as those add up to NaN
if(any(inf.long <- (inf.rle[['lengths']] > 1L))) {
inf.pos.g <- group_sum2(!inf.neg, inf.grps)
inf.neg.g <- group_sum2(inf.neg, inf.grps)
inf.res[inf.neg.g > 0] <- -Inf
inf.res[inf.pos.g & inf.neg.g] <- NaN
} else {
inf.res[inf.neg] <- -Inf
}
x.grp.c[inf.rle[['values']]] <- inf.res
}
if(!na.rm && has.na)
x.grp.c[findInterval(na.x, grp.rle.c, left.open=TRUE) + 1L] <- NA
structure(x.grp.c, groups=grp.rle[['values']], n=grp.rle[['lengths']])
}
sys.time(x.grpsum2 <- group_sum2(x, grp))
user system elapsed
1.147 0.323 1.479
all.equal(x.grpsum, x.grpsum2)
[1] TRUE
For more details on why the hash table based
unique
is affected by ordering see the pixie dust post.↩Additional logic will be required for handling NAs.↩
I did some light research but did not find other obvious uses of this method. Since this approach was not really practical until
data.table
radix sorting was added to base R in version 3.3.0, its plausible it is somewhat rare. Send me links if there are other examples.↩R code that carries out looped operations over vectors in compiled code rather than in R-level code.↩
This is a slightly modified version of the original from prior post that is faster because it uses
mean.default
instead ofmean
.↩mean(x)
in R is not exactly equivalent tosum(x) / length(x)
becausemean
has a two pass precision improvement algorithm.↩We leave it as an exercise to the reader to implement a fast
group_median
function.↩